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A HYBRID SCHEME FOR ENCODING AUDIO SIGNAL USING HIDDEN 

MARKOV MODELS OF WAVEFORMS 

S. MOLLA AND B. TORRESANI 

Abstract. This paper reports on recent results related to audiophonic signals encoding using 
time-scale and time-frequency transform. More precisely, non-linear, structured approximations 



o 

^S] ■ for tonal and transient components using local cosine and wavelet bases will be described, yield- 

f^ ' ing expansions of audio signals in the form tonal + transient + residual. We describe a general 



formulation involving hidden Markov models, together with corresponding rate estimates. Esti- 



< 

fvj ' mators for the balance transient /tonal are also discussed. 

H 
^. 

^ i 1. Introduction: structured hybrid models 

-4— > 

S ■ 

Recent signal processing studies have shown the importance of sparse representations for 

various tasks, including signal and image compression (obviously), de-noising, signal identifica- 

^>0 • tion/detection,... Such sparse representations are generally achieved using suitable orthonormal 



bases of the considered signal space. However, recent developments also indicate that redundant 



00 

in 

"^ . systems, such as frames, or more general "waveform dictionaries" may yield substantial gains 
^^ in this context, provided that they are sufficiently adapted to the signal/image to be described. 

From a different point of view, it has also been shown by several authors that in a signal or 
image compression context, significant improvements may be achieved by introducing structured 
approximation schemes, namely schemes in which structured sets of coefficients are considered 
rather than isolated ones. 

The goal of this paper is to describe a new approach that iniplements both ideas, via a 
hybrid model involving sparse, structured, random wavelet/MDCTu expansions, where the sets 
of considered coefficients (the significance maps) are described via suitable (hidden) Markov 
models. 



^MDCT: Modified Discrete Cosine Transform. 
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This work is mainly motivated by audio coding applications, to which we come back after 
describing the models and corresponding estimation algorithms. However, similar ideas may 
clearly be developed in different contexts, including image [20] and image sequence coding, 
where both ingredients (hybrid and structured models) have already been exploited. 



1.1. Generalities, sparse expansions in redundant systems. Very often, signals turn out 
to be made of several components, of significantly different nature. This is the case for "natural 
images" , which may contain edge information, regular textures, and "non- stationary" textures 
(which carry 3D information.) This is also the case for audio signals, which among other 
features, contain transient and tonal components [7], on which we shall focus more deeply. It is 
known that such different features may be represented efficiently in specific orthonormal bases. 
Following the philosophy of transform coding, this suggests to consider redundant systems made 
out by concatenation of several families of bases. Such systems have been considered for example 
in [SmSlIll], where the problem of selecting the "sparsest" expansion through linear programing 
has been considered. 

Focusing on the particular application to audio signals, and limiting ourselves to transient 
and tonal features, we are naturally led to consider a generic redundant dictionary made out of 
two orthonormal bases, denoted by ipx and w^ respectively (typically a wavelet and an MDCT 
basis) , and signal expansions of the form 

(1) ^ = X^ "aV^a + ^ P&ws + r , 

AeA (5gA 

where A and A are (small, and this will be the main sparsity assumption) subsets of the index 
sets, hereafter termed significance maps. The nonzero coefficients a\ are independent A/'(0, cr^) 
random variables, and the nonzero coefficients (3s are independent A/'(0, a"^) random variables: r 
is a residual signal, which is not sparse with respect to the two considered bases (we shall talk 
of spread residual), and is to be neglected or described differently. 
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The approach developed in [HI [121 [S] may be criticized in several respects when it conies to 
practical implementation in a coding perspective. On one hand, it is not clear that the corre- 
sponding linear programing algorithms are compatible with practical constraints, in terms of 
CPU and memory requirementqj. Also, models exploiting solely sparsity arguments cannot cap- 
ture one of the main features of some signal classes, namely the persistence property: significant 
coefficients have a tendency to form "clusters" , or "structured sets" . For example, in an audio 
coding context, the significance maps take the form of ridges (i.e. "time-persistent" sets, see 
e.g. [21 E] in a different context) for the MDCT map A, and binary trees for the wavelet map A. 
This remark has been exploited in various instances, for example in the context of the sinusoidal 
models for speech [19] , of for image coding [H [5l [25l [26] 

Several models may be considered for the A and A sets (termed significance maps), with 
variable levels of complexity. If only sparsity is used, they may be chosen uniformly distributed 
(in a finite dimensional context.) We shall rather work in a more complex context, and use 
(hidden) Markov chains to describe the MDCT ridges in A (in the spirit of the sinusoidal 
models of speech), and (hidden) binary Markov trees for the wavelet map A, following ^. This 
not only yields a better modeling of the features of the signal, but also provides corresponding 
estimation algorithms. 

To be more specific, a tonal signal is modeled as 

Xton = ^ f^sWS , 

5eA 

the functions wg being local cosine functions. The (significant) coefficients /3s, S E A are Af{0, d-g) 
independent random variables. The index 6 is in fact a pair of time- frequency indices 6 = {k, u), 
and the significance map A is characterized by a "fixed frequency" Markov chain (see e.g. [16] 
for a simple account), hence by a set of initial frequencies ui, . . .un and transitions matrices 
Pi, . . . P/v (one for each frequency bin). 



for example, for audio signals typically sampled at 44.1 kHz. 
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Globally, the tonal model is characterized by the set of matrices P^, and the variances al of 
the two states, which are assumed to be time invariant, and on which additional constraints 
may be imposed. The tonal model is described in some details in section [2l 

A similar model, using Hidden Markov trees of wavelet coefficients [5] may be develop to 
describe the transient layer in the signal: 

xtr = ^ axi^x , 

AgA 

ip being a wavelet with good time localization. The rationale is now to model the scale persistence 
of large wavelet coefficients of the transients, exploiting the intrinsic dyadic tree structure of 
wavelet coefficients (see Figure [5] below.) Again, the significant wavelet coefficients {ax, A G A} 
of the signal are modeled as independent Af{0, af ) random variables. The index A is in fact a 
pair of scale-time indices 6 = (j, k), and the significance map A is characterized by a "fixed time" 
Markov chain, hence by corresponding "scale to scale" transition matrices Pj (with additional 
constraints which ensure that significant coefficients inherit a natural tree structure, see below.) 
The transient model is therefore characterized by the variances of wavelet coefficients in A and 
A^, and the persistence probabilities, for which estimators may be constructed. The transient 
states estimation itself is also performed via classical methods. These aspects are described in 
section [31 

1.2. Recursive estimation. Several approaches are possible to estimate the significance maps 
and corresponding coefficients in models such as ([T]), ranging from the above mentioned linear 
programing schemes (see for example [3]) to greedy algorithms, including for instance Matching 
pursuit [13], dB]. The procedure we use is in some sense intermediate between these two extremes, 
in the spirit of the techniques used in [I] . We consider a dictionary made of two (orthonormal) 
bases; a first layer is estimated, using the first basis, and a second layer is estimated from the 
residual, using the second basis. The main difficulty of such an approach lies on the fact that 
the number of significant elements from the first basis has to be known in advance (or at least 
estimated.) In other terms, the cardinalities |A| and |A| of the significance maps have to be 
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known. This is important, since an underestimation or overestimation of |A| (assuming that the 
A-layer is estimated first) will "propagate" to the estimation of the second layer (the A-layer.) 
In the framework of the the Gaussian random sparse models studied below, it is possible to 
derive a priori estimates for the cardinalities |A| and |A|, using information measures in the 
spirit of those proposed in [30] and studied in [27]. Consider the geometric means of estimated 
Tpx and ws coefficients 





/ N \ 1/^ 


(2) 


N^= lill(a^,V^n)M and 


Then, 


assuming spartity, the indices 


(3) 
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Ny.= n^^'^")!' 



^n=l 



N,. 



turn out to provide estimates for the proportion of significant w and ip coefficients. The rationale 
is the fact that under sparsity assumptions (i.e. if A and A are small enough), most coefficients 
(x, ipn) (resp. {x, Wn)) will come from the tonal (resp. transient) layer of the signal, and therefore 
give information about it. This aspect is discussed in more details in section HI 

1.3. Audio coding applications. As mentioned earlier, the primary motivation for this work 
was audio coding. We briefly sketch here the assets of the model we are developing in such a 
context. 

Coding involve (lossy) quantization of the selected coefficients {(x, ws), 6 G A} and {{x, ipx), X G 
A}. These are Gaussian random variables, which means that corresponding rate and distortion 
estimates may be obtained. We notice that the introduction of structured significance maps 
does not improve the quality of the approximation (as measured by L^ distortion); however it 
improves the efficiency of significance map encoding (see below); in addition, for audio appli- 
cations, since structured significance maps seem to be relevant in the sense that they describe 
more accurately elementary "sound objects", they often yield better approximations of audio 
signals, from perceptual points of view. 
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Besides coefficients, significance maps have to be encoded as well. However, the Markov 
models make it possible to compute explicitly the probabilities of ridges lengths (for A) and 
trees sizes, which allows one to obtain directly the corresponding optimal lossless code. Again, 
rate estimates may be derived explicitly. 

It is also worth pointing out some important issues (in a coding perspective), which we shall 
not address here. The ffist one is the encoding of the residual signal 

It was suggested in [7] that the residual may be encoded using standard LPC technique^. 
However, it appears that in most situations (at least for large enough bit rates), encoding the 
residual is not necessary, the transient and tonal layers providing a satisfactory description of 
the signal. 

A second point is related to the implementation of perceptive arguments (e.g. masking): 
the goal is not really to obtain a lossy description of the signal with a small distortion: the 
distortion is rather expected to be inaudible, which has little to do with its i"^ norm. In the 
proposed scheme, this aspect will be addressed at the level of coefficient quantization (as in most 
perceptive coders.) However let us point out that the "structural decomposition" involving well 
defined tonal and transient layers shall make it possible to implement separately frequency 
masking on the tonal layer, and time masking on the transient layer, which is a completely 
original approach. This work (in progress) will be partly reported in [6]. 

2. Structured Markov model for tonal 

We start with a description of the ffist layer of the model. We make use of the local cosine 
bases constructed by Coifman and Meyer. Let us briefly recall here the construction, in the case 
we shall be interested in here. Let i G M"*" and rj G M^,?7 < £/2. Let w be a smooth function 



LPC: Linear Predictive Coding. 
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(called the basic window) satisfying the following properties: 

(4) supp(w) C [0 — ri,i + T]] 

(5) w{—t) = w{t) for all \t\ < t] 

(6) w{£ — t) = w{£ + t) for all |r| < ?7 

(7) ^ti;(t-A;£)2 = 1, Vt . 

k 

and set 

(8) Wkn{t) = \l- w{t - k£) COS ( ^^^ ^ ^^^^ (t - kiyj , neZ+,keZ. 

Then it may be proved that the collection of such functions, when n spans Z+ and k spans Z, 
forms an orthonormal basis of L^(]R). Versions adapted to spaces of functions with bounded 
support, as well as discrete versions, may also be obtained easily. We refer to [30] for a detailed 
account of such constructions. The classical choice for such functions amounts to take an arc of 
sine wave as function w. We shall limit ourselves to the so-called "maximally smooth" windows, 
by setting r] = i/2. 

In the framework of the recursive estimation scheme we are about to describe, the simplest 
(and natural) idea would be to start by expanding the signal with respect to a local cosine basis, 
and pick the largest coefficients (in absolute value, after appropriate weighting if needed) to form 
a best A^-term approximation [7]. However, as may be seen in the middle image of Figure [H 
such a strategy would automatically "capture" local cosine coefficients which definitely belong 
to transients (i.e. seem to form localized, "vertical" structures.) In order to avoid capturing 
such undesired coefficients, it is also natural to use the "structure" of MDCT coefficients of 
tonals, i.e. the fact that they have a tendancy to form "horizontal ridges" . This is the purpose 
of the tonal model described below. In the glockenspiel example of Figure [T], such a strategy 
produces a tonal layer whose MDCT is exhibited in the bottom image, from which it is easily 
seen that only "horizontally structured" coefficients have been retained. 
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Figure 1. Estimating a tonal layer; top: glockenspiel signal; middle: logarithm 
of absolute value of MDCT coefficients of the signal; bottom: logarithm of ab- 
solute value of MDCT coefficients of a tonal layer, estimated using "horizontal" 
structures in MDCT coefficients. 

2.1. Model and consequences. In the framework of the recursive approach, the signal is 
modeled as a structured harmonic mixture of Gaussians, i.e. expanded into an MDCT basis, 
with given cutoff frequency A^ 

Af-l 



(9) 



X 



y^ y^ yknWkn 



n=0 k 



where the coefficients of the expansion are (real, continuous) random variables Y^n whose dis- 
tribution is governed by a family of "fixed frequency" Hidden Markov chains (HMC) X^^, k = 
1, . . . . According to the usual practice, we shall denote by Yi-k^n (resp. Xi:fc,„) the random 
vector {Yin, ■ ■ -Ykn) (resp. (Xi„, . . . X^^)), and use a similar notation for the corresponding 
values {yin, . . -ykn) (resp. (a;i„, . . .Xkn)-) PYi.,k,„ and py^ will denote the joint density of Yi:fc,„ 
and the density of Y^n respectively, and the density of Y^n conditioned by X^n, assumed to be 
independent of k, will be denoted by 

i^n{y\x) = pY^„iy\Xkn = x) , X = T,N . 



To be more precise, the model is characterized as follows: 
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i. For all n, X.„ is a Markov chain with state space 

X = {T, R} 
("tonal" and "residual", or non-tonal) and transition matrix P„, of the form 



Pn 



the numbers 7r„, tt^ being the persistence probabilities of the tonal and residual states: 
for all n 

(10) ^„ = P {Xkn = T|Xfe_i„ = T} , 

(11) f'^=¥{Xkn = R\Xk.ln = R} ■ 

The initial frequencies of T and R states will be denoted by z/„ and 1 — z/„ respectively. 
For the sake of simplicity, we shall generally assume that the initial frequencies coincide 
with the equilibrium frequencies of the chain: 

2 - 7r„ - TT^ ' 
a. The (emitted) coefficients Y^n are continuous random variables, with densities denoted 

by Pyi:fc„(yi:fcn), 

Hi. The distribution of the (emitted) coefficients Ykn depends only on the corresponding 
hidden state Xkn] for each n, the coefficients Ykn are independent conditional to the 
hidden states, and their distribution do not depend on the time index k (but does 
depend on the frequency index n.) We therefore denote 



PYi..k„ {yi:kn\Xi:kn = Xi;kn) = ]^^n(l/; 



\Xi 






iv. In order to model audio signal, we shall limit ourselves to centered gaussian models for 
the densities ipk- The latter are therefore completely determined by their variances: a 
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large variance cr|. for the T type coefficients, and a small variance a\ for the R type 
coefficients. 

Therefore, the model is completely characterized by the parameter set 

9 = {7r„, TT^; z/„; aT,n, ^R,n] n = 0,...N -1} . 

Given these parameters, one may compute explicitly the likelihood of any configuration of coef- 
ficients. Using "routine" HMC techniques, it is also possible to obtain explicit formulas for the 
likelihood of any hidden states configuration, conditional to the coefficients. We refer to [21] for 
a detailed account of these aspects. 

Remark 1. Notice that in this version of the model, the transition matrix P is assumed to 
be frequency independent. More general models involving frequency dependent P matrices (or 
further generalizations) may be constructed, without much modifications of the overall approach. 

Given a signal model as above, we may define the tonal layer of such a signal. 

Definition 1. Let x be signal modeled as a hidden Markov chain MDCT as above, and let 

(12) A = {{k,n)\Xkn = T} . 

A is called the tonal significance map of x. Then the tonal and non tonal layers are given by 

(13) Xton = ^/35W5 , 
y^'^J -^nton -^ -^ton 

This definition makes it possible to obtain simple estimates for quantities of interest, such 
as the energy of a tonal signal, or the number of MDCT coefficients needed to encode it. 
For example, considering a time frame of K consecutive windows (starting from A; = for 
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simplicitjo) , and a frequency domain {0, ... A^ — 1}, we set 
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A 



{K,N) 



An{{0,...K-l} X {0,...A^-1}) 



and we denote by 

(15) 
(16) 



N(k) 

n 



|A 



{KAn})\ 



n 



E 



N, 



(K) 



K 



the random variables describing respectively the number and the expected proportion of T type 
coefficients in the frequency bin n, within a time frame of K consecutive windows. 

Proposition 1. With the notations of Definition^ the average proportion ofT type coefficients 
within the time frame {0, . . . K — 1} in the frequency bin n is given by 

i-(^„+^;-i)^^^^ 



(17) 



r. 



(A-) 



l^n{{^n + K-T) + {^-^'^\K- 



Proof: From classical properties of HMC, we have that 



P{Xfc„ = T} \ f ~t\^ i ^" 



2-7r„-7r; 



¥{Xkn = R}, 



I- U„ 



the superscript "t" denoting matrix transposition. After some algebra, we obtain the following 
expressions: 

((1 - ^nK - (1 - 0(1 - l^n)) (^n + K- 1)' + (1 - K) 



F{Xkn = T} 



2-Tln-TTn 

2_^ _^/ (1 - (7r„ + 7r„ - 1) ) . 



J^n (7r„ + Tt' - 1) + 



Similarly, we obtain for P {Xfc„ = i?} = 1 - P {Xkn = T} 



P {Xfc„ = i?} = (1 - z/„) {n^ + ^; - 1)^ + 



2 - vr„ - 7r„ 



V(i-(^« + <-i)') • 



In fact, this choice of origin matters only if the initial frequency ly of the chain is not assumed to equal the 
equilibrium frequency i'^'^' , which will not be the case in the situations we consider. 
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Finally, the result is obtained by replacing P{Xfc„ = T} with its expression in 

K-l 



^{Ni""^] = Y.^^^^^ = ^^ 



k=0 

which yields the desired expression. D 

Notice that in the limit of large time frames, one obtains the simpler estimate 

lini r(^) = ^~''" = z/('=) 

which of course does not depend any more on K. 

The energy of the tonal layer is also completely characterized by the parameters of the model, 
and has a simple behavior. 



Proposition 2. With the same notations as before, conditional to the parameters of the model, 
we have 



1 ^ ...I 1^^ 1 



E<^ > ■ ir,m =^ , ^ ^ ^^ 



K ^ \ ^\ ( j^ Z^ 2 - 7r„ - tt; 



(1 - (^„ + 7f; - 1)^) z/„4^ 



Tm 



1 - (^„, + 7f: - p^ 



Proof: the result follows from the fact that conditional to the hidden states, the considered 
random variables at fixed frequency are i.i.d. A/'(0, o"|.„) random variables. It is then enough to 
plug the expression of Tn obtained above in the L^ norm of the tonal layer. D 

Again, the latter expression simplifies in the limit K ^ oo, or ii the initial frequencies of the 
chains X„ are assumed to equal the equilibrium frequencies. In that situation, we obtain 



iV-l , _, Af-l 



(18) i^^lw E K0-E^i^f^4.-E 



K^oo I K . ^r-^ f ^~( 2 — 7r„ — tt; 



'^n '-'T,n 



SeAi^N. I n=0 ■■" " n=0 



Remark 2. Thanks to the simplicity of the Gaussian model, similar estimates may be obtained 
for other £^-type norms. 
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A fundamental aspect of transform coding schemes based on non-linear approximations such 
as the one we are describing here is the fact that the significance maps A have to be encoded 
together with the corresponding coefficients. Since the significance map takes the form of a 
series of segments of Ts and segments of Rs with various lengths, it is natural to use classical 
techniques of run length coding (see for example |i5j, Chapter 10, for a detailed account) to 
encode them. The corresponding bit rate depends crucially on the entropy of the distribution 
of T and R segments. For the sake of simplicity, let us introduce the entropy of a binary source 
with probabilities (p, 1 — p): 

(20) h{p) = -plog2(p) - (1 -p) log2(l -p) . 

Proposition 3. Assume that the initial frequencies of the chains X.„ equal their equilibrium 
frequencies. For each frequency bin n, the entropy of the distribution of lengths Ln of T and R 
segments reads 

(21) nLn) = -^^^M^„) + . ^"''".. MO ■ 

Proof: Denote by Lt and L/j the lengths of T and R segments. From the Markov model X it 
follows that Lt and Lr are exponentially distributed: 

P{L^ = £} = vf^-i(l-7f„), P{LH = £} = ^;^-i(l-7f;), £ = 1,2,... 

A simple calculation shows that the Shannon entropy of the random variable L^ is given by 

oo 
- J]P{Lt = n l0g2 (P{Lt = n) = -Vf„l0g2(^„) - (1 - ^n) log2(l - ^n) = K^^) , 

1=1 

and a similar expression for the Shannon entropy of Lr. Now, because of the assumption on 

the initial frequencies of the chains X„, and dropping the indices for the sake of simplicity, we 

have that 

1 -r 

p{x = r} 



2 - TT - 7f ' 
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and the equality 

H{L) = P {X = T} H{Lt) +¥{X = R} H{Lr) 

yields the desired result. D 

Finally, let us briefly discuss questions regarding the quantization of coefficients. The simplic- 
ity of the model (Gaussian coefficients, and Markov chain significance map) makes it possible 
to obtain elementary rate-distortion estimates. Indeed, the optimal rate-distortion function for 
Gaussians random variables is well known: for a A/'(0, cr^) random variable, 

(22) D{R)=a^2-^^. 

Let us assume that the T type coefficients at frequency n are quantized using i?„ bits per 
coefficient. Using the optimal rate-distortion function f l22|) . the overall distortion per time frame 
is given by 

N-l ^{K) 
n=0 

If we are given a global budget of R bits per sample, the optimal bit rate distribution over 
frequency bins is obtained by minimizing E{D} with respect to /?„, under the "global bit 
budget" constraint 

HY.^RA = nr, 

ln=0 J 

the expectation being taken with respect to the significance map A. Assuming for the sake 
of simplicity that the Markov chain is at equilibrium (i.e. z/„ = Un for all n), this yields the 
following simple expression 

N- 1 1 ^'^ 

(23) R^ = =R+- log2(4 J -^Y. '^^ l°g2(^' ' 



N 2 "^' "'"' 2N ^ 

'm=0 



ml 1 



where we have denoted by 

N-l 



^=E 



!/!'> 



n=0 
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the average number of T type coefficients per time frame. As usual in this type of calculation, 
the so-obtained optimal value of i?„ is generally not an integer number, and an additional 
rounding operation is needed in practice. The distortion obtained with the rounded bit rates is 
therefore larger than the bound obtained with the values above. Summarizing this calculation, 
and plugging these optimal bit rates into the expression of the distortion, we obtain 

Proposition 4. With the above notations, the following rate- distortion hound holds: for a given 
overall hit hudget of R hits per T type coefficient, 

(24) E {/}} > iV m af "'' 2-^NR/N 

\n=0 / 

2.2. Parameter and state estimation: algorithmic aspects. Hidden Markov models have 
been very successful because there exist naturally associated efficient algorithms for both param- 
eter estimation and hidden state estimation, respectively the Expectation-Maximization (EM) 
and Viterbi algorithms. However, while these are natural answers to the estimation problems 
in general situations, they are not so natural anymore in a coding setting, as we explain below. 
lYiom. a general point of view, an imput signal is ffist expanded with respect to an MDCT 
basis, corresponding to a fixed time segmentation (segments of approximately 20 msec.) Then, 
within larger time frames, the parameters are (re)estimated, as well as the hidden states. Pa- 
rameters are refreshed on a regular basis. 



2.2.1. Parameter estimation. Given the parameter set 9 of the model, the forward-backward 
equations allow one to obtain estimates for the probabilities of hidden states conditional to the 
observations: 

(25) pUT) = ^{Xkn = T\9,Y^.,K,n = yl■.K,n} , 

(26) Pkn{R) = ^{Xkn = R\0,Y^.,K,n = yl■.K,n} 
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and the likelihood of the parameters 

from which new estimates for the parameter set 9 may be derived. 

Remark 3. From a practical point of view, such parameter re-estimation happens to be quite 
costly. Therefore, the parameters are generally re-estimated on a larger time scale, taking several 
consecutive windows into account. 

Remark 4. For practical purpose, it is generally more suitable to restrict the parameter set 9 
to a smaller subset. The following two assumptions proved to be quite adapted to the case of 
audio signals: 

i. The variances may be assumed to be multiple of a single reference value, implementing 
some "natural" decay of MDCT coefficients with respect to frequency. For example, we 
generally used expressions of the form 






riQ G M^ being some reference frequency bin, as a reference standard deviation for state 
s and a G M^ some constant controlling the decay (typical value being a = 1). Without 
such an assumption, frequency bins are completely independent of each other, and the 
estimation algorithm generally yields T type coefficients in all bins, which is not realistic, 
a. For each frequency bin, the initial frequencies z/„ of the considered Markov chain are 
generally assumed to equal the equilibrium frequencies Vn . 

2.2.2. State estimation. Viterbi's algorithm is generally considered the natural answer to the 
state estimation problem. It is a dynamic programing algorithm, which yields Maximum a 
posteriori (MAP) estimates 



Xi,Kn = a.TgmaxF^Xi,Kn = Xi;Kn\yi:Kn,9j , 
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for each frequency bin n. However, the number of so-obtained coefficients in a given state (T or 
R) cannot be controlled a priori when such an algorithm is used, which turns out to be a severe 
limitation in a signal coding perspective. In addition, Viterbi's algorithm requires that accurate 
estimates of the model's parameters are available, which will not necessarily be the case if the 
parameter estimates are refreshed on a coarse time scale (see above.) 

Therefore, we also consider, as an alternative to Viterbi's algorithm, an a posteriori probabil- 
ities thresholding method, which is computationally far simpler, and allows a fine rate control. 
More precisely, given a prescribed rate Nton, 

i. Sort the MDCT coefficients rjkn = {x, Wkn) in order of decreasing a posteriori probability 

Pkn{T) in ([25D, 
a. Keep the Nton first sorted coefficients. 

In this way, for an average bit rate R and a prescribed "tonal" bit budget, a number Nton of 
MDCT coefficients to be retained may be estimated, and the Nton coefficients with largest a 
posteriori probability are selected. 

2.3. Numerical simulations. As a first test of the model and the estimation algorithms, we 
generated realizations of the structured harmonic mixture of Gaussians model described above, 
and used the corresponding estimation algorithms. We simulated a signal according to the 
"tonal + residual" Markov model as above, with about 3.1% T-type coefficients. We show in 
Figure [2] the result of the estimation of the tonal layer using EM parameter estimation, and 
state estimation via the Viterbi algorithm. As may be seen, the significance map is fairly well 
estimated, except in regions where the signal has little energy, which was to be expected. In 
these regions, the algorithm detects spurious (vertical) tonal structures, which results in an 
increase of the percentage of T type coefficients (about 4.1% instead of 3.2% for that example.) 
However, since this effect appears only in regions where the signal has small energy, this does not 
affect tremendously the estimated signal, which is very close to the simulated one (not shown 
here.) 
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Figure 2. Estimating a tonal layer from simulated signal; from top to bottom: 
simulated significance map, estimated significance map (estimation via the Viterbi 
algorithm), estimated tonal signal, estimated residual signal. 



For the sake of comparison, we display in Figures [3] and H] some examples of tonal layer 
estimation using the thresholding algorithm instead of the Viterbi algorithm, for various values 
of the threshold. The simulation presented in Figure [3] corresponds to 1% retained coefficients, 
while the simulation presented in FlGUREJUcorresponds to 3% retained coefficients. As expected, 
the significance map in Figure [3] appears much terser than the "true" one, while the one in 
Figure H] is much closer (percentage of retained coefficients significantly larger than the true 
one yield spurious tonal structures.) This results in tonal components which were not correctly 
captured, and appear in the residual signal of Figure [31 This is not the case any more when 
the threshold is set to a more "realistic" value, as may be seen in the tonal and residual layers 
of Figure m In that case, the residual only features a small spurious component. Notice that 
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FlGURE 3. Estimating a tonal layer from simulated signal (same as in Figure |2]); 
from top to bottom: estimated significance map (estimated via the posterior prob- 
ability thresholding algorithm, using 1% coefficients); estimated tonal signal, es- 
timated residual signal. 

even though significantly less coefficients are retained, the overall shape of the estimated signal 
is quite good. 

Remark 5. Clearly, the posterior probability thresholding method only provides an approxi- 
mation of the "true" tonal layer (which is provided by the Viterbi algorithm), whose precision 
depends on the choice of the threshold, i.e. the bit rate allocated to the tonal layer. Controlling 
the relation between the bit rate and the precision of the approximation would lead to a rate 
distortion theory for the "functional" part of the tonal coder. Such a theory seems extremely 
difficult to develop, and so far we could only study it by numerical simulations (not shown here.) 



3. Structured Markov model for transient 

3.1. Hidden wavelet Markov tree model. We now turn to the description of the transient 
model, which was partly presented in [21]. The latter exploits the fact that wavelet bases are 
"well adapted" for describing transients, in the sense that these generally yield scale-persistent 
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Figure 4. Estimating a tonal layer from simulated signal (same as in Figure H]); 
from top to bottom: estimated significance map (estimated via the posterior prob- 
ability thresholding algorithm, using 3% coefficients); estimated tonal signal, es- 
timated residual signal. 



chains of significant wavelet coefficients. We start from a multiresolution analysis (see for exam- 
ple [IZIEH]) and the corresponding wavelet ip G i^^(M), scaling function cp G L^(M) and wavelet 
basis, defined by 

Given x G L^(M), its wavelet coefficients djk = {x,tpjk) are naturally labelled by a dyadic tree, 
as in Figure in which it clearly appears that a given wavelet coefficient djk may be given a 
pair of children dj+i 2k and rfj+i 2fc+i- For the sake of simplicity, we shall sometimes collect the 
two indices j, k into the scale-time index A = (j, k). 

For the sake of simplicity, we consider a fixed time interval, and a signal model involving 
finitely many scales, of the form 



(27) 



J 2-'-J-l 

X = Sjocpjo + X] X] Djkifjk 

j=l fc=0 
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Figure 5. Wavelet coefficients tree. 



involving 



N{J) =2^-1 



random wavelet coefficientcl, whose distribution is a gaussian mixture governed by a hidden 
random variable. 

More precisely, distribution of the wavelet coefficients Djk depends on a hidden state Xjk G 
{T, R} (T stands for "transient", and R for "residual".) At each scale j, the T-type coefficients 
are modelled by a centered normal distribution with (large) variance a^ •. The -R-type coefficients 
are modelled by a centered normal distribution with (small) variance cr^j. 

The distribution of hidden states is given by a "coarse to fine" Markov chain, characterized by 
a 2 X 2 transition matrix, and the distribution of the coarsest scale state. In order to retain only 
connected trees, we impose a taboo transition: the transition i? — )■ T is forbidden. Therefore, 
the transition matrix assumes the form 



Pi 



where ttj denotes the scale persistence probability, namely the probability of transition T ^^ T 
at scale j: 




The scaling function coefficients Sjq are generally irrelevant for audio signals, and do not deserve much modelling 
effort. 
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The hidden Markov process is completely determined by the set of matrices Pj and the "initial" 
probability distribution, namely the probabilities v = {ut, vr) of states at the maximum con- 
sidered scale j = J. The complete model is therefore characterized by the numbers ttj, z/, and 
the emission probability densities: 

ps{d) = p{d\X = S) , S = T,R. 

In the sequel, we shall always assume that the persistence probabilities are scale independeno 

According to our choice (centered Gaussian distributions), the latter are completely character- 
ized by their variances a^j and crj^j- All together, the model is completely specified by the 
parameter set 

(28) e = {u, vr, aT,j, ctrj, j = 1 ... J} , 

which leads to the definition of transient significance map (termed transient feature in |21] ) 

Definition 2. Let the parameter set in i\28\} be fixed, and let x denote a signal given by a hidden 
Markov tree model as in (21) above. Consider the random set 



(29) A={(j,A;),j = l,...j, /c = 0,...2^-l|X,, = r} . 

A is called the transient significance map of x. The corresponding transient layer of x is defined 
as 

(30) Xtr = ^ Djkijjk ■ 

(j,fe)GA 



This is actually quite a strong assumption, which has the advantage of reducing the number of parameters 
to estimate. Alternative choices can also be considered, for example controlling the growth of the number of 
significant coefficients across scales by setting ni = ttq for some constant ttq. 
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From this definition, one may easily derive estimates on various coding rates. The key point 

is the following immediate remark. Let Nj denote the number of T-type coefficients at scale j, 

and let 

J 

the total number of T-type coefficients at scale j. The following result is fairly classical in 
branching processes theory (see for example [TOl fT6].) 

Proposition 5. Let x denote a signal given by a hidden Markov tree model as in (F^ above. 
Then the number N of T type coefficients is given by a Galton- Watson process. In particular, 
one has 

(31) E{Nj} = u{27ry-^ , N := E {N} = u ^^^^^-^ 

27r — 1 

(with the obvious modification for the case it = 1/2. j 

Therefore, it is obvious to obtain estimates for the energy of a transient layer: 
Corollary 1. The average energy of the transient layer of a signal x reads 

(32) e] y. iA-.ri = ^E^i(2^)'"^'- 

[j,k;Xjk=T ) j=l 

Another simple consequence is the following a priori estimate for the cost of significance map 
encoding. It is known that it is possible to encode a binary tree at a cost which is linear in the 
number of nodes. We use the following strategy for encoding the tree A (even though it is not 
optimal, it has the advantage of being simple. Improvements may be obtained by using entropy 
coding techniques, taking advantage of the probability distribution of trees, which is known as 
soon as the persistence probability vr is known.) We associate with each node of A a pair of bits, 
set to or 1 depending on whether the left and right children of the node belong to A or not. 
Therefore, Rsm is not larger than twice the number of nodes of A, i.e. the number of T-type 
coefficients. Therefore, we immediately deduce 
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Corollary 2. Given the set of parameters 9, and the corresponding Hidden Markov wavelet tree 
model, let Rsm denote the number of bits necessary to encode the significance map of a transient 
wavelet coefficients tree, as above. Then we have 

, 2z/x ^ — '— if-K^O.b, 

^{Rsm}<{ l-2vr 

2uJ z/7r = 0.5. 

The simplicity of the transient model (i.e. Galton- Watson significance map, and Gaussian 
T coefficients) makes it possible to derive simple rate-distortion estimates, along lines similar 
to the ones we followed for the tonal layer. Assume that the T type coefficients at scale j are 
quantized using Rj bits. Assuming (122|) . the overall distortion is given by 

J 






2Ri 



Suppose we are given a global budget of R bits per sample. Minimizing E {D} with respect to 
Rj, under the "global bit budget" constraint 

E|x^Ar,i?,l=Ar(J):R 

yields the following simple expression 

(33) R, = ^5+ il„g,(a|) - 1^^^(2.)^-. log,(a|) . 

i=i 

Therefore, plugging this expression into the optimal rate-distortion function (122|) . we obtain the 
following rate-distortion estimate 

Proposition 6. With the same notations as before, we have the following estimate: for a given 
overall bit budget of R bits per T type coefficient, the distortion is such that 

(J _\ i/Jv 

(34) E {/}} > iV m af^ 2"2^(-^)«/^ , 
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where we have set 

iV"=z/(27^)^-^ N = u^^^^^-^. 
^ V ; , 271-1 

3.2. Parameters and state estimation. As in the case of the tonal layer, the parameter esti- 
mation and the hidden state estimation may be realized through standard EM and Viterbi type 
algorithms. These algorithms are mainly based upon adapted versions of the above mentioned 
forward-backward algorithm: the so-called "upward-downward" algorithm, proposed by Grouse 
and collaborators in ^. Actually, we rather used a variant, the downward-upward algorithm, 
due to Durand and Gongalves [TT], which provides a better control of numerical accuracy of the 
computations. As a result, the algorithm provides estimates for quantities such as the hidden 
states probabilities 

F {Xjk = s\D^.,2J^i = d^..2J-il,e} 

and the likelihood 

3.2.1. Parameters estimation. The parameter estimation goes along lines similar to the ones out- 
lined in Section [2.2. II (see also [21] for additional details.) Again, since the parameter estimation 
procedure, involving upward- downward algorithm, is quite costly, it is done simultaneously on 
several consecutive time windows (i.e. several consecutive trees), and parameters are "refreshed" 
on larger time scales. 

3.2.2. Hidden states estimation. Again, the situation is very similar to the situation encountered 
when dealing with the tonal layer. The "Viterbi-type" algorithm described in [11] theoretically 
provides an estimate for "the" transient significance map, and therefore the transient layer. 
However, it does not allow one to control the number of selected coefficients (the rate), and is 
therefore not appropriate in a context of variable bit rate coder. Hence, we rather turn to the 
(also computationally simpler) alternative, using thresholding of a posteriori probabilities. 

The upward- downward algorithm provides estimates for the probabilities 
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Figure 6. Estimating a transient layer from simulated signal; from top to bot- 
tom: simulated signal, simulated transient layer, simulated significance tree, es- 
timated transient layer (estimation via the Viterbi algorithm), estimated signifi- 
cance tree, estimated residual signal. 



Therefore, the corresponding tree nodes may be sorted according to the latter (in decreasing 
order.) For a given transient bit budget, a maximal number of nodes to be retained A''^^ may be 
estimated, and the nodes with largest "transientness" probability Pjk(T) are selected, and the 
corresponding transient layer is reconstructed. 



3.3. Numerical simulations. As for the case of the tonal layer, it is easy to perform numerical 
simulations of the model to evaluate the performances of the estimation algorithms. We display 
in Figure [6] the results of such simulations, using EM algorithm for parameter estimation, 
and the Viterbi algorithm for hidden states estimation. As may be seen from the plots, the 
significance tree and the transient layer are quite well estimated. 
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Again, using the posterior probability thresholding method instead of the Viterbi method 
yields approximate transient layer, and the discussion of Remark E] still hold true. 

4. The "tonal vs transient" balance 

We have described in Sections [2] and [3] two models for tonal and transient layers in audio 
signals, and corresponding estimation algorithms. One of the main aspects of the latter is that 
the hidden states estimation is based on thresholding of a posteriori probabilities rather than on 
a global Viterbi-type estimation, which allows to accomodate any bit rate prescribed in advance. 

However, as stressed in the Introduction, and described in more detail in the subsequent 
section, we develop a coding approach based upon recursive estimations of tonal and transient 
layers. We describe below an approach for pre-estimating the relative sizes of the tonal and 
transient layers, in order to balance the bit budget between the two layers prior to estimation. 
The reader interested in more details is invited to refer to 



4.1. Pre-estimating the "sizes" of the tonal and transient layers. Consider a signal 
assumed for simplicity to be of the form ([1]), with unknown values of |A| and |A|, we seek 
estimates for the "transientness" and "tonality" indices 



(35) 



lAI , lAI 



^ton I A I , I A I ) ^tr 



|A| + |A| ' " |A| + |A| ' 

or alternatively, the proportion of the signal's energy contained in the tonal and transient layers. 
For simplicity, we limit ourselves to the finite dimensional situation, and propose a procedure 
very much in the spirit of the information theoretic approaches advocated by M.V. Wickerhauser 
and collaborators [23 EO] ■ 

Definition 3. Let B = {en,n G S} be an orthonormal basis of a given N -dimensional signal 
space £. The logarithmic dimension of x E £ in the basis B is defined by 

(36) PB(a;) = ^5^1og2(|(a;,e„)n 

ng5 
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We aim to show that such quantity may provide the desired estimates, under suitable assump- 
tions on the signal (sparsity) and the considered bases (incoherence.) Elementary calculations 
show that in the framework of the signal models ([T]), one has the following 

Lemma 1. Given an orthonormal basis B = {en,n G S}, assuming that the coefficients {x,en) 
of X E £ are J\f{0,a^) random variables, one has 



(37) E{Vs{x)} = C+j^J2^og,{al) 

neS 

where C = 1 + 7/ ln(2) (7 ^ .5772156649 being Euler's constant.) 

Consider now the model ([T]), and assume that the coefficients a\,\ G A and /35,(5 G A are 
respectively A/'(0, af) and A/'(0, af) independent random variables. Then the coefficients 

ax = {x,i^\) ; bs = {x,ws) , 

are centered normal random variables, whose variances depends on whether A G A (or S E A) 
or not. For example, in the case of the ax coefficients, 

,^^, r T i ^l + T.seA^5\{^^^s)\^ ifAGA 

[Sq) varjaAJ = < 

[ EseA^sK^^^s)? ifA^A, 

which yields 

(39) E{V^{x)} = C + ^\ogJ]jLl+J2^'s\{^x,ws)A n(E^'K^A',^.)P 

\AeA\ 5eA J AYA\5eA 

and a similar expression for the logarithmic dimension T>\y{x) with respect to the W = {wg} 
basis. 

For the sake of simplicity, we now assume that o"a = (x, VA G A and as = cr, V5 G A. Introduce 
the Parseval weights 



(40) px{A) = Y,\{ws,^Px)\\ Ps{A) = J2\{ws,t 



A/l^ 



5eA AeA 
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The Parseval weights provide information regarding the "dissimilarity" of the two considered 
bases. The following property is a direct consequence of Parseval's formula: 

Lemma 2. With the above notations, the Parseval weights satisfy 

< pa(a) < 1 , < ps{^) < 1 . 

Introduce the relative redundancies of the bases ^ and W with respect to the significance 
maps 

(41) e(A) = maxpA(A) , e(A) = maxp^fA) . 

AgA (5gA 

These quantities carry information similar to the one carried by the Babel function used in 
for example. One then obtains simple estimates for the logarithmic dimension 



Proposition 7. With the above notations, assuming that the significant coefficients a;A,A G 
A and Ps,S E A are i.i.d. A/'(0, a^) and A/'(0, a^) normal variables respectively, one has the 
following bound 



(42) E{P*(x)} > C+^-^log,{a')+\ogJll{a'pyiA)Y^A 

Vava / 

(43) E{I)^(x)} < C+^-^log,ia' + eiA)a') + logJll{a'pA^)Y^A 

Vava / 



YA 

Exchanging the roles of A and A, a similar bound is obtained for Vw{x) . 

At this point, several comments have to be made. 

a. The bounds in Equations (^ and (USD differ by |A| log2(l + e{A)a^ /a^)/N. Let us 
temporarily assume that this term may be neglected (see comment b. below for more 
details.) The behavior of E {Vq,{x)} is therefore essentially controlled by 






l/N 
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Such an expression is not easily understood, but a first idea may be obtained by replacing 
Pa'(A) by its "ensemble average" 

1 ^ 1 ^ 1 lAI 

A=l A=l 5gA (5gA 

which yields the approximate expression: 

(44) ^{V^{x)] ^ C+ ^log2(a^) + (l - ^) log2 (^'^ 

Therefore, if the "\E'-component" of the signal is sparse enough, i.e. if |A|/A^ is sufficiently 
small (compared with 1), E{P^(a;)} may be expected to behave as log2 \^'^-j^]i which 
suggests to use 

(45) N^{x) = 2^*(") 

as an estimate (up to a multiplicative constant) for the "size" of the W component of 
the signal. Notice that this expression coincides with ([2]), 

b. The difference between the lower and upper bounds depends on two parameters: the 
sparsity |A|/A^ of the \l'-component, and the relative redundancy parameters e(A). The 
latter actually describe the intrinsic differences between the two considered bases. When 
the bases are significantly different, the relative redundancy may be expected to be small 
(notice that in any case, it is smaller than 1), 

c. The relative redundancy parameters e and e which pop up in our model differs from the 
one which is generally considered in the literature, namely the coherence of the dictionary 
VTU^ (see e.g. [S1[I21[I1]) 

lj[WU^= sup \{b,b')\ , 

b,b'eWWB 

b^b' 
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and the Babel function (see [2Sl [S].) The latter are intrinsic to the dictionary, while 
the Parseval weights and corresponding e and e provide a finer information, as they also 
account for the signal models, via their dependence in the significance maps A and A, 
d. Precise estimates for e and e are fairly difficult to obtaiiiJ. What would actually be 
needed is a tractable model for the significance maps A and A, in the spirit of the 
structured models described in the two previous sections (for which we couldn't obtain 
simple estimates.) Returning to the wavelet and MDCT case, it is quite natural to expect 
that models implementing time persistence in A and scale persistence in A would yield 
smaller values for the relative redundancies than models featuring uniformly distributed 
significance maps. 

A more detailed analysis of this method (including a discussion of noise robustness issues) is 
presented in [22] . 

4.2. Numerical simulations. The above discussion suggest to use the logarithmic dimensions 
in order to get estimates for the relative sizes of the tonal and transient layers in audio signals. 
We shall use the following estimated proportions 

(46) Iton = — 7— ; hr - 



In order to validate this approach, we computed these quantities on simulated signals of the 
form ([T]), as functions of |A| (resp. |A|) for fixed values of |A| (resp. |A|.) The result of such 
simulations is displayed in Figure [71 which show Iton and Itr as functions of |A|, together 
with the theoretical curves defined in fl35|) . averaged over 20 realizations. As may be seen, the 
results are fairly satisfactory, which indicates that such indicator may be used for estimating 
the percentage of bit rate to be allowed to the different components, prior to the hybrid coding 
itself. 



Our numerical results using wavelet and MDCT bases suggest that these numbers are generally of the order of 
1/4: any waveform from a given basis always finds a waveform from the other basis which "looks like it". 
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Figure 7. Simulations of tonality and transientness indices, as functions of |A| 
or |A| (time frames of 1024 samples): theoretical curves and simulation (averaged 
over 10 realizations); top plot: |A| = 40, varying |A|, Itr (decreasing curve) and 
Iton (increasing curve); top plot: |A| = 40, varying |A|, Iton (decreasing curve) 
and Itr (increasing curve.) 
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Figure 8. Tonal vs Transient balance for a real audio signal (a musical signal.) 
Left: long signal (about 23 seconds long): top plot: original signal; bottom plot: 
transientness index. Right: shorter (1.5 seconds long) segment, same legend. 



An example on real audio signal is displayed in Figure [8], which represents the transientness 
index (from which the tonality index is easily deduced) for a segment (about 23 seconds) of audio 
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signal (the mamavatu signajj, which will be used again as illustration in the next section.) A 
shorter segment of 1.5 seconds (located in the middle of the large segment) is analyzed similarly 
in the right hand plots of FIGURE [H] As may be seen, the transientness index (lower curves) 
exhibits significant local maxima in the neighborhood of the various "attacks" of the signal (see 
the left hand plots of Figure El) Notice also on the right hand plots of Figure |8] that the 
transientness index exhibits an overall decay in the rightmost part of the plot. This is mainly due 
to the fact that a significant tonal component shows up in that part of the signal (see Figure [Til] 
in the next section), which reduces the proportion of transients (we recall that the transientness 
index really measures the proportion, and not the quantity of transient signal present.) 

Remark 6. It is worth noticing that the indices Iton and Itr perform satisfactorily as long as 
the two expansions in ([1]) are sparse enough. Otherwise, deviations from the "ideal" behavior 
have to be expected, as may be seen in the right hand side of the plots in Figure [71 

Remark 7. Also, Iton and Itr provide estimates for the sizes of significance maps only when 
the variances a"^ and a^ are of comparable magnitude. When this is not the case, it is easily 
seen that they rather provide estimates on the relative energies of the two layers, for example 
Itr = |^|c"^/(|^|'^^ + l^|5"^)- The behavior of the indices in noisy situations (i.e. with small, 
additive white noise) may be studied as well, and yields similar conclusions, as long as the noise's 
energy is small enough |22] . 

5. Application to audio coding 

The ideas developed above are currently being implemented within a prototype hybrid audio 
coder, extending the ideas already described in [7]. While the idea of hybrid coding of audio 
signals is not new, our approach is the first one that implements hybrid transform coding without 
prior (time) segmentation of the signal. A detailed account of the coding system will be given 
in a forthcoming publication, together with systematic performance evaluations. However, we 



available at the web site 
http : //www . cmi . univ-mrs . f r/jborresan/papers/Markov 
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find it interesting to sketch the main features here, as they provide a thorough applications of 
the probabihstic models we just described. 
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Figure 9. Block diagram of the hybrid audio coding scheme 

5.1. Description of a prototype coder. The block diagram of the encoder is displayed in 
Figure |9] (the corresponding decoder simply amounts to invert MDCT and DWIu transforms 
from the encoded coefficients). The first step of the algorithm is a pre-estimation of the relative 
sizes of the tonal and transient layers, according to the discussion of section HI Hence, any given 
bit budget may be allocated a priori to the different layers of the signal. 

The second step is the estimation of the (structured) tonal layer, according to section |5J The 
parameters of the hidden Markov models are estimated and updated on large time frames, and 



9 



DWT: Discrete Wavelet Transform. 



An Hybrid Audio Scheme using Hidden Markov Models of Waveforms. submitted to ACHA, 35 

the hidden states are estimated by thresholding of a posteriori probabihty. This yields estimated 
tonal and non-tonal layers 

■^ton /_j\ 1 ^<5/"'i5 ) •^nton -^ ■^ton ■ 

The tonal layer is then quantized and encoded using standard techniques (either uniform quan- 
tization, or Lloyd-Max quantization, for gaussian sources, followed by entropy coding), while 
the non tonal layer is transmitted to the transient layer estimator. Since the parameters of the 
model (i.e. the persistence probabilities) provide explicitly the probabilities of lengths of "tonal 
structures", the corresponding Huffmann code is readily obtained, and used for encoding the 
significance map. 

The third step is the estimation of the transient layer from the non-tonal component. Again, 
transform coding is computed within time frames of about 23 milliseconds. The parameters 
of the hidden Markov model are estimated, and updated on larger time frames. Hidden states 
(i.e. the significance map) are estimated within each (small) time frame by thresholding of a 
posteriori state probability. Once the transient layer Xtr has been estimated, it is substracted 
from the signal to yield the residual; in parallel, the coefficients are quantized and entropy coded. 
The tree structure of the transient significance map make it possible to derive an efficient way 
of encoding it (see [2T|.) 

Xfr / j X-^ntoniWXlYX i •''res Xnton ■''tr • 

AgA 

The residual is finally modeled as a (locally) stationary random process, and currently encoded 
as such using fairly classical LPC procedures (even though this might not be the optimal solution 
for very low bit rate; this subject is currently under study.) 

Notice that while the encoding procedure is quite complex (involving fairly sophisticated 
estimation algorithms), the decoding is extremely simple. The tonal and transient layers are 
reconstructed on the basis of their significance maps and corresponding encoded coefficients. 
The residual is re-generated using LPC technique. 
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Figure 10. Compressed hybrid expansion of a piece of musics (mamavatu, about 
23 seconds long.) From top to bottom, and from left to right: original signal, tonal 
layer, nontonal signal, transient layer, residual layer, and reconstruction from the 
three layers. 

5.2. Numerical illustrations. An example of hybrid (or multilayered) signal expansion ob- 
tained using the technique described in this paper is shown in Figure [TD] (see Figure [S] for 
the corresponding transientness index.) In that example 6% of coefficients were retained (no 
coefficient quantization was done, so this essentially represents only the "functional" part of the 
compression.) 

To demonstrate the ability of the proposed procedure to yield good signal approximations, 
we display in Figure [11] a comparison of Signal to Noise (SNR) curves for various encoding 
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Figure 1 1 . Comparison of "functional" Signal to Noise Ratios for various coding 
techniques, computed on the same test signal. Two configurations are represented: 
in the left-hand plot, coefficients were selected without prior normalization in the 
N-term MDCT (x), DWT (o) and hybrid (o) methods, and a was equivalently set 
to in the Markov method (D, see Remark H]). In the right-hand plot, MDCT 
coefficients were normalized by a linear function of the frequency in the N-term 
MDCT and hybrid methods (no change in the wavelet method), and a was set to 
1 in the Markov method. 



techniques, namely classical MDCT and DWT transform coding, as well as hybrid coding as 
proposed in [7] and the approach described in this paper. As may be seen, the performances of 
the different approaches are essentially comparable, the standard DWT and MDCT transform 
coding techniques (in the left curve) being better by a few dB. The N-term hybrid method 
appears to be slightly better than the Markov one. This is not really surprizing, as the SNR is 
computed from the L^ distortion, and the introduction of structures in the approximation cannot 
improve the L? distortion in comparison with simple coefficient thresholding. Nevertheless, we 
notice that this effect is a very weak one. 

For the same reason, introducing a normalization on MDCT coefficients prior to the selection 
(right-hand part of FIGURE [TT]) strongly penalizes the N-term MDCT method in terms of L^ 
distortion. Interestingly enough, this does not seem to be the case for the two hybrid methods. 
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One main outcome of the proposed method is that it provides a decomposition of the signal 
into several layers (two, or three if the residual is considered), i.e. one has to encode several 
signals instead of one. Nevertheless, Figure [TT] (right) shows that in terms of SNR, this 
approach remains comparable with a standard DWT approximation scheme. In addition, the 
significance maps are encoded much more efficiently in the Markov approach (see Proposition [3] 
and Corollary 12]) . 

Remark 8. It is worth pointing out an important aspect of audio signal coding. It is well 
known that the L^ distortion is not an adequate distortion measure for audio signals: it does 
not take into account the variation of the hearing threshold as a function of frequency (which 
could be done by an appropriate weighting of the LF' norm in the frequency domain), nor the 
nonlinear masking effects, which are extremely important from the perceptual point of view (see 
for example [23] for a review). The natural distortion measure that introduces itself naturally 
in the present scheme is the likelihood, which takes into account the structures proposed in the 
model. For that reason, we believe that such a distortion measure is more "natural" from a 
perceptual point of view, though we do not have a clear evidence yet. 

To illustrate the latter remark, the interested reader may find on the companion web site of 
this paper (see footnote[8]) further illustrations of the behaviour of the proposed coding technique 
on real audio signals, as well as corresponding sound examples. 

6. Conclusions and perspectives 

This work was based on the belief that efficient signal modeling cannot be based solely on con- 
siderations on individual coefficients on a well chosen basis (even though one generally tries to use 
"bases that decorrelate" ) and may be seen as an attempt to systematically exploit "structures" , 
or "persistence properties" in the coefficient domain. In this respect, the main contributions of 
this article are the new hybrid model we propose, and the a priori rate estimations which may 
be deduced from it, thanks to the relative simplicity of the model (First order Markov chains, 
and Gaussian distributions.) 
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We have specially emphasized in Section [S] the application to audio coding and compression. 
More details on the current implementation of the codec will be published elsewhere [0] together 
with a more complete analysis of quantization issues, and more detailed numerical results. 
Further developments involve designing coefficient quantization procedures specifically adapted 
to the tonal and transient layers, as well as the implementation of adapted masking methods 
(frequency masking and time masking.) 

Finally, we would also like to point out that compression is far from being the only application 
of such models, and that coding is not limited to compression applications. An efficient coding 
scheme, such as the one we propose, should also prove useful for various applications such as 
automatic music transcription (exploiting the tonal layer) and onset detection (exploiting the 
transient layer). The multilayered signal representation could also simplify other audio pro- 
cessing tasks such as time-stretching, or various other signal modification problems, which may 
thus be performed directly in the coefficient domain. Among the other potential applications 
of such techniques, let us also mention blind source separation, which we plan to investigate in 
the future. 
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